Skip to content

Conversation

@xichen01
Copy link
Contributor

What changes were proposed in this pull request?

Let the audit log record the detailed execution time

  • Time-consuming interaction with OM: metadataLatencyMs/...
  • Complete read and write elapsed time: createKeyLatencyMs/getKeyLatencyMs/ getBucketKeyCount/...
  • Read/write length, number of lists: putLength/getLength/copyLength/...

Example:

2023-11-23 19:52:20,703 | INFO  | S3GAudit | user=ozone | ip=[0:0:0:0:0:0:0:1] | op=CREATE_KEY {bucket=[pony-bucket1], path=[15-2M.img]} performance={metadataLatencyMs=37, putLength=52428800, createKeyLatencyMs=1744} | ret=SUCCESS |
2023-11-23 19:52:32,020 | INFO  | S3GAudit | user=ozone | ip=[0:0:0:0:0:0:0:1] | op=GET_KEY {bucket=[pony-bucket1], path=[15-2M.img]} performance={metadataLatencyMs=69, readLength=52428800, getKeyLatencyMs=1806} | ret=SUCCESS |
2023-11-23 19:52:47,366 | INFO  | S3GAudit | user=ozone | ip=[0:0:0:0:0:0:0:1] | op=GET_BUCKET {bucket=[pony-bucket1], list-type=[2], encoding-type=[url], prefix=[]} performance={getBucketKeyCount=1, getBucketLatencyMs=67} | ret=SUCCESS |

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9718

How was this patch tested?

Manual test, Existing test

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the patch.

I'm not sure audit log was intended for such purposes, but since every operation is (supposed to be) logged there, including per-operation performance metrics may make sense.

@adoroszlai adoroszlai requested a review from szetszwo November 23, 2023 17:29
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 , thanks for working on this! It is inefficient to first construct a Map and then convert it to a String. We should build the String directly; see the comment inlined.

BTW, the change is quite long. See if you could split this into two tasks (optional).

// Set to an Array so that S3GAction can be changed within the function.
final S3GAction[] s3GAction = new S3GAction[] {S3GAction.CREATE_KEY};
boolean auditSuccess = true;
Map<String, String> perf = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a PerformanceStringBuilder to build the performance string instead of a Map.

//AuditLogger.java
  public static class PerformanceStringBuilder {
    private final StringBuilder builder = new StringBuilder(1024).append('[');

    public void appendNanos(String name,long nanos){
      append(name, TimeUnit.NANOSECONDS.toMillis(nanos));
    }

    public void append(String name,long value){
      builder.append(name)
          .append('=')
          .append(value)
          .append(", ");
    }

    public String build() {
      final int length = builder.length();
      if (length < 2) {
        return "[]";
      }
      builder.setCharAt(length - 2, ']');
      builder.setLength(length - 1);
      return builder.toString();
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been implemented

@adoroszlai adoroszlai requested a review from szetszwo November 26, 2023 17:48
@adoroszlai
Copy link
Contributor

Thanks @xichen01 for updating the patch.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@QueryParam("uploadId") @DefaultValue("") String uploadID,
InputStream body) throws IOException, OS3Exception {
long startNanos = Time.monotonicNowNanos();
// Set to an Array so that S3GAction can be changed within the function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems out-dated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


//ObjectEndpoint
CREATE_MULTIPART_KEY,
CREATE_MULTIPART_KEY_BY_COPY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it COPY_MULTIPART_KEY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referencing the AWS API name

CREATE_MULTIPART_KEY: AWS API name UploadPart
CREATE_MULTIPART_KEY_BY_COPY: AWS API name UploadPartCopy

I think it's a better match, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's keep it then.

@adoroszlai
Copy link
Contributor

@duongkame @kerneltime would you like to take a look?

Copy link
Contributor

@duongkame duongkame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure audit log was intended for such purposes, but since every operation is (supposed to be) logged there, including per-operation performance metrics may make sense.

Indeed, not every operation is logged by default, but just the write ones.

What is the use of the metrics in individual log entries? I guess it may come from the fact that our metrics only support avg for now and outliners such as p90, p99 are missed. But isn't it fixed as per #5627.

getMetrics().incListKeyCount(keyCount);
perf.appendCount(keyCount);
perf.appendOpLatencyNanos(opLatencyNs);
AUDIT.logReadSuccess(buildAuditMessageForSuccess(s3GAction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUDIT may not log reads because of performance, but the perf does always build(). Should the performance data be passed as a lamba instead, so that a new string is not actually built when read audi logs are not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem that deserves attention
We currently generate AuditMessages via buildAuditMessageForSuccess, etc. In the AuditMessageForSuccess we perform the build of all relevant messages, regardless of whether or not the logs are logged.
This is probably not just a perf problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the problem also applies to the existing, "functional" part of the audit message.

Audit logging is enabled by default, and we need several changes to improve it, so I think we should do it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopened HDDS-2638 for the improvement of audit message processsing/allocations.

@xichen01
Copy link
Contributor Author

Indeed, not every operation is logged by default, but just the write ones.

by default, the audit log is enabled to log "read and write" operations
hadoop-ozone/dist/src/shell/conf/s3g-audit-log4j2.properties

What is the use of the metrics in individual log entries? I guess it may come from the fact that our metrics only support avg for now and outliners such as p90, p99 are missed. But isn't it fixed as per #5627.

Part of the reason is yes, more in order to record the processing of S3G in more detail, when some users react to some key get/put/list timeout, slower, we can based on the key provided by the user, we know exactly how long each step takes, which is very helpful in locating the performance problem.

The time taken to perform each step for a particular key is difficult to obtain from the Metrics.

@adoroszlai adoroszlai merged commit b59366d into apache:master Nov 28, 2023
@adoroszlai
Copy link
Contributor

Thanks @xichen01 for the patch, @duongkame, @guohao-rosicky, @szetszwo for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants